Introduction

We have now preprocessed and merged our single cell data. The next step is to analyse the data for cell type identification, identification of marker genes and et cetera. We will focus on some simple steps of downstream analysis for single cell RNA-sequencing (scRNA-seq) data as shown below.

  1. What are the cell types present in the dataset?
  2. What are these cell clusters?
  3. For a gene of interest, how can I visualise the gene expression distribution?
  4. What are the cell type composition in the data?
# This section is to install package that will be required to proceed with this section of the workshop.

# If you do not have "devtools" installed, uncomment the following line and run on your console
# install.packages("devtools")

# devtools::install_github("SydneyBioX/scdney")

We will be using some of the functions we developed in our scdney package. You may visit our package website for the vignette and further details about scdney.

# We will subset Su et al. and Yang et al. datasets.
ids = which(colData(sce_scMerge)$batch %in% c("GSE87795", "GSE90047"))

lab = colData(sce_scMerge)$cellTypes[ids]

nCs = length(table(lab))

mat = SummarizedExperiment::assay(sce_scMerge, "scMerge")[,ids]

Q1. What are the cell types present in the dataset?

The data from single cell RNA-sequencing experiment does not provide cell type information for individual cells. We need to identify cell types of indivdual cells in our data with bioinformatics analysis.

Before we identify the cell types in the dataset, we need to first identify how many distinct group of population we can find from our data. One common method to achieve this is by clustering. Clustering is a type of machine learning technique to group similar samples (cell) to the same group and partition samples that are different by comparing their feature information (gene expression).

To optimise clustering method, algorithm and similarity metric are two key components that affects clustering performance.

In our recent study, we found pearson correlation to be optimal similarity metric for comparing single cell RNA-seq data ( Kim et al., 2018 ). Therefore in this workshop, we will utilise scClust in our scdney package, which implemented 2017 Nature methods clustering algorithm SIMLR with pearson correlation.

How many distinct group of population are there in my dataset?

Typical clustering methods (except for some methods like hierarchical clustering) require users to specify number of distinct groups (k) to cluster from your data. In an unsupervised setting, we do not know the exact number and therefore we will run clustering for various number of k.

# We will not run for various `k` to save time. Instead, we will load pre-computed results for `k` between 3 to 8

# This is an easy way to run `scClust` for k = (3,4,5,6,7,8).
# all_k = 3:8
# simlr_results = sapply(as.character(all_k), function(k) {
#   scClust(mat, as.numeric(k), similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
# }, USE.NAMES = TRUE, simplify = FALSE)

# For demonstration purpose, we will run k = 6 (which is actually the number of cell types in our dataset)
simlr_result_k6 = scClust(mat, 6, similarity = "pearson", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)
## Computing the multiple Kernels.
## Performing network diffiusion.
## Iteration:  1 
## Iteration:  2 
## Iteration:  3 
## Iteration:  4 
## Iteration:  5 
## Iteration:  6 
## Iteration:  7 
## Iteration:  8 
## Iteration:  9 
## Iteration:  10 
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  0.1527458 
## Epoch: Iteration # 200  error is:  0.1218641 
## Epoch: Iteration # 300  error is:  0.1112158 
## Epoch: Iteration # 400  error is:  0.1059176 
## Epoch: Iteration # 500  error is:  0.1026344 
## Epoch: Iteration # 600  error is:  0.1003277 
## Epoch: Iteration # 700  error is:  0.0985936 
## Epoch: Iteration # 800  error is:  0.09722343 
## Epoch: Iteration # 900  error is:  0.09610318 
## Epoch: Iteration # 1000  error is:  0.09516611 
## Performing Kmeans.
## Performing t-SNE.
## Epoch: Iteration # 100  error is:  11.56649 
## Epoch: Iteration # 200  error is:  0.2845774 
## Epoch: Iteration # 300  error is:  0.2560054 
## Epoch: Iteration # 400  error is:  0.2410104 
## Epoch: Iteration # 500  error is:  0.2361371 
## Epoch: Iteration # 600  error is:  0.2331395 
## Epoch: Iteration # 700  error is:  0.231009 
## Epoch: Iteration # 800  error is:  0.229431 
## Epoch: Iteration # 900  error is:  0.2281798 
## Epoch: Iteration # 1000  error is:  0.2271669
# load("~/Dropbox (Sydney Uni)/workdir/singleCell/workshop/data/simlr_result_k6.RData")

load(paste(getwd(), "/data/simlr.results.RData", sep = ""))

How do we select optimal k?

If our k is correct, we expect that the each clusters are closely packed together and the distance between the clusters, within-cluster sum of squares (WSS) are expected to be small. Thus, we will select the k with a small total WSS (compact clusters). This is called the “Elbow” method.

# Find total WSS from all cluster outputs
all_wss = sapply(simlr_results, function(result) {
  sum(result$y$withinss)
}, USE.NAMES = TRUE, simplify = TRUE)

plt.dat = data.frame(
  k = names(all_wss),
  total_wss = all_wss
)

ggplot(plt.dat, aes(x = as.numeric(as.character(k)), y = total_wss)) +
  geom_point(size = 3) +
  stat_smooth(method = loess, col = "red", method.args = list(degree = 1)) +
  ylab("Total WSS") +
  ggtitle("Compare Total WSS for each 'k'")

As shown in this plot, the graph begin to plateau from k = 5. We can estimate that k is around 5 or 6. We can further investigate using silhouette scores or other metrics but using our t-SNE plot (and with a little cheating) we will estimate k = 6.

Quick quiz

  1. We can determine k with compactness of the clusters. What other measure can we use to determine k?
  2. How would you determine optimal number of clusters for hierarchical clustering?

For the purpose of demonstration, we would like to highlight the effect of similarity metric to your cluster output.

# To run scClust with euclidean distance, uncommnet the following lines.
# simlr_result_eucl_k6 = scClust(mat, 6, similarity = "euclidean", method = "simlr", seed = 1, cores.ratio = 0, geneFilter = 0)

# for convenience, we will load our pre-computed result
load(paste(getwd(), "/data/simlr_result_eucl_k6.RData", sep = ""))
# create tsne object
set.seed(123)
tsne_result = Rtsne(t(mat), check_duplicates = FALSE)

#################################################
tmp_lab = as.numeric(factor(lab))
pear_cluster = mapvalues(
  simlr_result_k6$y$cluster,
  from = c(1,2,3,4,5,6),
  to = c(2,3,1,4,6,5)
)
eucl_cluster = mapvalues(
  simlr_result_eucl_k6$y$cluster,
  from = c(1,2,3,4,5,6),
  to = c(6,2,4,3,1,5)
)
#################################################

plt.dat = data.frame(
  tsne1 = rep(tsne_result$Y[,1], 3),
  tsne2 = rep(tsne_result$Y[,2], 3),
  cluster = factor(c(tmp_lab, pear_cluster, eucl_cluster)),
  label = factor(c(rep("Truth", length(lab)), rep("Pearson", length(lab)), rep("Euclidean", length(lab))), levels = c("Truth", "Pearson", "Euclidean"))
)

ggplot(plt.dat, aes(x = tsne1, y = tsne2, colour = cluster, colors)) +
  geom_point(size = 2) +
  ggtitle("t-SNE plot") +
  facet_grid(cols=vars(label)) + 
  theme(legend.position = "none")

Quick quiz

  1. From your observation of the above t-SNE plot, did our clustering method (with Pearson correlation) group cells well?

[Optional] We can evaluate the clustering performance using our ground truth.

# ARI
ari = c(mclust::adjustedRandIndex(lab, simlr_result_eucl_k6$y$cluster), mclust::adjustedRandIndex(lab, simlr_result_k6$y$cluster))

# NMI
nmi = c(igraph::compare(as.numeric(factor(lab)), simlr_result_eucl_k6$y$cluster, method = "nmi"), igraph::compare(as.numeric(factor(lab)), simlr_result_k6$y$cluster, method = "nmi"))

plt.dat = data.frame(
  dist = rep(c("Euclidean", "Pearson"), 2),
  value = c(ari, nmi),
  eval = rep(c("ARI", "NMI"), each = 2)
)

ggplot(plt.dat, aes(x = dist, y = value, fill = dist)) + 
  geom_bar(stat="identity") + 
  facet_grid(col = vars(eval)) +
  labs(x = "Similarity metrics", y = "Evalution score", title = "Affect of similarity metrics in scRNA-seq data")

Q2. What are these cell clusters?

We have grouped all the cells to 6 distinct groups. We now want to identify what these groups are, i.e. what cell types are there in my dataset? Thus, what defines a cell type?

We can use marker genes to identify cell types.

2.1 What are the marker genes that distinguish the different cell types?

Here we provide a function that allow one to find differentially expressed gene between a cluster and the remaining clusters. The input of this function is the cluster id. The output is a list of gene and its associated p - value.

# This function is specific for scClust ouptut object that ran with "SIMLR"
findmarker <- function(mat, scClust_obj, cluster_id){

  #group 1 is the cluster of interest
  group1 = which(scClust_obj$y$cluster == cluster_id)  
  group1 = colnames(mat)[group1]  #find out which cell belongs to group 1
  
  group_label <- ifelse(colnames(mat) %in% group1, "group1", "group2")  #if the cell is in group 1 ,create a label called group 1, else create a label called group 2
  
  #create a dataframe , with cell name and the group that this cell belongs to 
  group_label <- data.frame(group_label, colnames(mat))
  rownames(group_label) <- group_label[,2]
  colnames(group_label) <- c("group","wellKey")
  #MAST requires a wellkey (which essentially is the name of the cell)
  
  
  #create a dataframe, with gene name 
  gene_label <- data.frame(rownames( mat))
  colnames(gene_label) <- "primerid"
  rownames(gene_label) <- gene_label[, 1]
  #MAST requires a primerid (which essentially is the gene name)
  
   #create a MAST object, with the group id of each cell and the gene name 
  sca <- MAST::FromMatrix(
      exprsArray = mat,
      cData = group_label,
      fData = gene_label
  )

  #this is the differential expression model (called a hurdle model)
  #(see https://www.bioconductor.org/packages/release/bioc/vignettes/MAST/inst/doc/MAITAnalysis.html#4_differential_expression_using_a_hurdle_model)
  zlmCond <- MAST::zlm(~group, sca) 
  summaryCond <- summary(object = zlmCond, doLRT = 'groupgroup2') 
  
  #now extract the information 
  summaryDT <- data.frame(summaryCond$datatable)
  return_val <-data.frame(summaryDT[summaryDT[, "component"] == "H", 1], summaryDT[summaryDT[, "component"] == "H", 4])
  #We select the "H" rows, the "H" means we want hurdle P values
  #extract both the gene name and the P value associated with this gene
  colnames(return_val) <- c("gene", "P_value")
  
  #order by the p value , from the most significant 
  return_val <- return_val[order(return_val$P_value), ]
  
  return (return_val)

}

Here we provide an example of using the findmarker gene function.
To find out the marker gene in cluster 4, we type in 4 in the findmarker function. We then look at the top 10 genes ranked by p-value. We can then use ggplot to visualise the distribution of one of the genes across the dataset.

df<- data.frame(tsne_result$Y)
df<- cbind(df, as.factor(simlr_result_k6$y$cluster) )

df = df %>%
  dplyr::rename(
    tsne1 = X1,
    tsne2 = X2,
    cluster = `as.factor(simlr_result_k6$y$cluster)`
  )


marker <- findmarker(mat, simlr_result_k6, 4)
## Assuming data assay in position 1, with name et is log-transformed.
## 
## Done!
## Combining coefficients and standard errors
## Calculating log-fold changes
## Calculating likelihood ratio tests
## Refitting on reduced model...
## 
## Done!
marker[1:10,]
##          gene       P_value
## 12739    Gys2 6.068095e-153
## 19151    Shbg 2.513240e-134
## 12924     Hgd 1.577501e-126
## 19364 Slc27a2 8.229521e-124
## 6096     Gcgr 7.663681e-122
## 16209     Otc 1.338106e-115
## 12842      Hc 1.016114e-113
## 15057  Mogat2 1.509759e-113
## 12729    Gulo 6.801648e-113
## 14081   Lect2 1.976735e-112
index <- which(rownames(sce_scMerge[,ids]) == "Gys2")
# df_Gys2 <- cbind(df, mat[index,] )
df_Gys2 = df
df_Gys2$Gys2 = mat[index,]

ggplot(data =df_Gys2, mapping = aes(x=tsne1, y = tsne2, colour = Gys2) ) +
  geom_point(alpha=0.5) +
  scale_color_viridis() +
  theme_bw() +
  labs(col="Gys2 expression", x = "tsne1", y = "tsne2")

Quick quiz

  1. What does the above t-SNE plot tell you?

Here we repeat the analysis as above, but for cluster 3. See if you can understand the output.

marker <- findmarker(mat, simlr_result_k6, 3)
## Assuming data assay in position 1, with name et is log-transformed.
## 
## Done!
## Combining coefficients and standard errors
## Calculating log-fold changes
## Calculating likelihood ratio tests
## Refitting on reduced model...
## 
## Done!
marker[1:10,]
##                gene       P_value
## 88    1700011H14Rik 1.187064e-210
## 5220         Erich5 8.248276e-200
## 1257       Adamts16 4.544699e-193
## 3539          Cldn3 2.663833e-177
## 20941        Tmem27 2.406824e-172
## 11554         Gm609 7.277614e-164
## 14560      Marveld3 3.901531e-152
## 21906          Vnn1 5.224090e-147
## 13356        Igfbp7 5.040702e-134
## 19960          Spp1 8.186524e-133
index <- which(rownames(sce_scMerge[,ids]) == "Erich5")
# df_erich<- cbind(df, mat[index,] )
df_erich <- df
df_erich$erich = mat[index,]

ggplot(data =df_erich, mapping = aes(x=tsne1, y = tsne2, colour = erich) ) +
  geom_point(alpha=0.5)+scale_color_viridis()+
  theme_bw()+
  labs(col="Erich5 expression")

Q3. For a gene of interest, how can I visualise the gene expression distribution?

In Q2, we have identified some interesting marker genes from our dataset. If we have a gene that we know, and we want to identify the its expression pattern in our dataset, we can also visualise the distribution. For example, Hnf4a has been stated in literature as a marker gene for hepatoblast cell.

The figure on the left highlight cluster 4 and the figure on the right highlight the expression of Hnf4a.

This suggests that cluster 4 could belong to hepatoblast cell.

fig1 <- ggplot(data = df, mapping = aes(x=tsne1, y = tsne2) ) + 
  geom_point(aes(color = ifelse(cluster == 4, 'Yellow', 'Purple') ) ,alpha=0.4) + scale_colour_viridis_d() + 
  ggtitle("Cluster 4") +
  xlab("") + 
  ylab("") + 
  labs(colour = "") +
  theme(legend.title=element_blank()) +
  theme_bw()+ guides(color=FALSE) 

index <- which(rownames(sce_scMerge[,ids]) == "Hnf4a")
df_hnf4a <- df
df_hnf4a$hnf4a = mat[index,]

fig2 <- ggplot(data =df_hnf4a, mapping = aes(x=tsne1, y = tsne2, colour = hnf4a) ) +
  geom_point(alpha=0.5) +
  scale_color_viridis() +
  theme_bw()+labs(col="expression") +
  ggtitle("Hnf4a expression pattern")

ggarrange(fig1,fig2, ncol= 2, nrow = 1)

See if you can understand the output of the following code.

fig1<- ggplot(data =df, mapping = aes(x=tsne1, y = tsne2) ) + 
  geom_point(aes(color = ifelse(cluster == 3, 'Yellow', 'Purple') ) ,alpha=0.4) +
  scale_colour_viridis_d() + 
  ggtitle("Cluster 3") +
  labs(colour = "")+
  theme(legend.title=element_blank())+
  theme_bw()+ 
  guides(color=FALSE)


index <- which(rownames(sce_scMerge[,ids]) == "Epcam")
df_epcam<- cbind(df, mat[index,] )

fig2 <- ggplot(data =df_epcam, mapping = aes(x=tsne1, y = tsne2, colour = `mat[index, ]`) ) +
  geom_point(alpha=0.5)+
  scale_color_viridis()+
  theme_bw()+
  labs(col="expression")+
  ggtitle("Epcam expression pattern")

ggarrange(fig1,fig2, ncol= 2, nrow = 1)

As you may have already noticed from above steps, unsupervised clustering methods alone are not perfect in capturing cell type information. Using this step iteratively, we need to refine our cell type information.

Q4. What are the cell type composition in the data?

NOTE: From this step, we will assume we have correctly refined our cell type information from above steps and we will use our “known” cell type information.

Cell type proportions

plt.dat = data.frame(
  table(lab)
)

ggplot(plt.dat, aes(x = lab, y = Freq, fill = lab)) +
  geom_bar(stat = "identity") +
  labs(x = "Cell types", y = "Frequency", title = "Composition of cell types")

We observe that hepatoblast/hepatocyte is the largest population.

[Extension]. Any rare cell type? What is the most common cell type?

SessionInfo

sessionInfo()
## R version 3.6.0 (2019-04-26)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: macOS Mojave 10.14.5
## 
## Matrix products: default
## BLAS:   /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.6/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
## 
## attached base packages:
## [1] parallel  stats4    stats     graphics  grDevices utils     datasets 
## [8] methods   base     
## 
## other attached packages:
##  [1] plyr_1.8.4                  ggpubr_0.2                 
##  [3] magrittr_1.5                viridis_0.5.1              
##  [5] viridisLite_0.3.0           MAST_1.10.0                
##  [7] ggplot2_3.1.1               cluster_2.0.9              
##  [9] Rtsne_0.15                  mclust_5.4.3               
## [11] scdney_0.1.2                edgeR_3.26.4               
## [13] limma_3.40.2                dplyr_0.8.1                
## [15] SingleCellExperiment_1.6.0  SummarizedExperiment_1.14.0
## [17] DelayedArray_0.10.0         BiocParallel_1.18.0        
## [19] matrixStats_0.54.0          Biobase_2.44.0             
## [21] GenomicRanges_1.36.0        GenomeInfoDb_1.20.0        
## [23] IRanges_2.18.1              S4Vectors_0.22.0           
## [25] BiocGenerics_0.30.0        
## 
## loaded via a namespace (and not attached):
##   [1] amap_0.8-17            minqa_1.2.4            colorspace_1.4-1      
##   [4] class_7.3-15           ggridges_0.5.1         htmlTable_1.13.1      
##   [7] XVector_0.24.0         base64enc_0.1-3        rstudioapi_0.10       
##  [10] mice_3.5.0             prodlim_2018.04.18     manipulate_1.0.1      
##  [13] mvtnorm_1.0-10         lubridate_1.7.4        codetools_0.2-16      
##  [16] splines_3.6.0          doParallel_1.0.14      knitr_1.23            
##  [19] Formula_1.2-3          nloptr_1.2.1           caret_6.0-84          
##  [22] clusteval_0.1          broom_0.5.2            compiler_3.6.0        
##  [25] backports_1.1.4        assertthat_0.2.1       Matrix_1.2-17         
##  [28] lazyeval_0.2.2         acepack_1.4.1          htmltools_0.3.6       
##  [31] tools_3.6.0            igraph_1.2.4.1         gtable_0.3.0          
##  [34] glue_1.3.1             GenomeInfoDbData_1.2.1 reshape2_1.4.3        
##  [37] Rcpp_1.0.1             nlme_3.1-140           iterators_1.0.10      
##  [40] timeDate_3043.102      xfun_0.7               gower_0.2.1           
##  [43] stringr_1.4.0          lme4_1.1-21            dendextend_1.12.0     
##  [46] pan_1.6                zlibbioc_1.30.0        MASS_7.3-51.4         
##  [49] scales_1.0.0           ipred_0.9-9            doSNOW_1.0.16         
##  [52] expm_0.999-4           RColorBrewer_1.1-2     yaml_2.2.0            
##  [55] gridExtra_2.3          rpart_4.1-15           latticeExtra_0.6-28   
##  [58] stringi_1.4.3          randomForest_4.6-14    foreach_1.4.4         
##  [61] blme_1.0-4             e1071_1.7-2            checkmate_1.9.3       
##  [64] boot_1.3-22            lava_1.6.5             rlang_0.3.4           
##  [67] pkgconfig_2.0.2        bitops_1.0-6           evaluate_0.14         
##  [70] lattice_0.20-38        purrr_0.3.2            labeling_0.3          
##  [73] recipes_0.1.5          htmlwidgets_1.3        cowplot_0.9.4         
##  [76] tidyselect_0.2.5       R6_2.4.0               snow_0.4-3            
##  [79] DescTools_0.99.28      generics_0.0.2         Hmisc_4.2-0           
##  [82] mitml_0.3-7            pillar_1.4.1           foreign_0.8-71        
##  [85] withr_2.1.2            abind_1.4-5            survival_2.44-1.1     
##  [88] RCurl_1.95-4.12        nnet_7.3-12            tibble_2.1.2          
##  [91] crayon_1.3.4           jomo_2.6-8             rmarkdown_1.13        
##  [94] locfit_1.5-9.1         grid_3.6.0             data.table_1.12.2     
##  [97] ModelMetrics_1.2.2     digest_0.6.19          tidyr_0.8.3           
## [100] munsell_0.5.0